Before starting this project i did some reasearch in order to get some insights about white wines. The chemicals in the wine are only the 2% of the composition of wine. The other 98% are water and alcohol. Some studies have shown that there are no specific chemicals that are directly related to the quality. Now that we know this I am prepared to face the fact that the plots might not show any great differences.
library(ggplot2)
library(GGally)
replacing previous import by ‘utils::capture.output’ when loading ‘GGally’replacing previous import by ‘utils::head’ when loading ‘GGally’replacing previous import by ‘utils::installed.packages’ when loading ‘GGally’replacing previous import by ‘utils::str’ when loading ‘GGally’
library(scales)
library(lattice)
library(MASS)
library(memisc)
Note: the specification for S3 class “family” in package ‘MatrixModels’ seems equivalent to one from package ‘lme4’: not turning on duplicate class definitions for this class.
Note: the specification for class “character or NULL” in package ‘memisc’ seems equivalent to one from package ‘SparseM’: not turning on duplicate class definitions for this class.
Attaching package: ‘memisc’
The following object is masked from ‘package:scales’:
percent
The following objects are masked from ‘package:stats’:
contrasts, contr.sum, contr.treatment
The following object is masked from ‘package:base’:
as.array
library(RColorBrewer)
library(gridExtra)
wines = read.csv("wineQualityWhites.csv")
wines$quality <- as.factor(wines$quality)
Once I started doing some plots, I realized that the colors of the plots where inconclusive and not helpful in the task of identifying the differences, given the fact that all of them where fairly similar shades of blue. So in the attempt of changing the color we realized that we needed to factor the quality variable so we could assing a different color to each factor. # Exploring the data ## Univariate Analysis Before I start plotting it is important that we study the each and every variable for themselves. This will be done in order to get some insight of our data.
summary(wines)
X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600 Min. :0.00900 Min. : 2.00
1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
Median :2450 Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200 Median :0.04300 Median : 34.00
Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391 Mean :0.04577 Mean : 35.31
3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800 Max. :0.34600 Max. :289.00
total.sulfur.dioxide density pH sulphates alcohol quality
Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00 3: 20
1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 4: 163
Median :134.0 Median :0.9937 Median :3.180 Median :0.4700 Median :10.40 5:1457
Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51 6:2198
3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 7: 880
Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20 8: 175
9: 5
In the summary above we can see the distribution of the quality of our data in a more specific way. For the purpouse of the project we are going to stablish a “logical barrier” in which we are going to declare good wine as the ones with a 7 or more, normal wine the ones between 5 and 7 (not incluiding 7) and finally as bad wine everything under 5. I hope that can help us to inferhow the different variables affect the quality. This has been decided due to the reduced numbers in high grade wines, with a 9 for example, because we have to face the fact that in such reduced numbers an outlier can greatly affect the conclussions we are going to stablish. However, if we decide to stablish a trend rather an specific value, we will be able to predict how some values in our variables affect the quality of our wine. For example if we see that the fixed.acidity in the good wines tends to higher values and the normal and bad ones concentrate in lower ones, it will be safe to assume that a higher value in fixed.acidity will affect the final quality of our product. Given that our main objective is to asses which varaibles or characteristics affect primarily to the quality of our wine, first I think its important to know the distribution of our data.
ggplot(aes(x=quality), data = wines) +
geom_bar()
As we can see a great number of our wines score a 6 and just a little percentage scores an 8 or higher. Once we know that the amount of these wines is reduced we need to observe how they are shown in the following plots.
wines$classification <- ifelse((wines$quality == 3) | (wines$quality == 4), "bad", ifelse((wines$quality == 5) | (wines$quality == 6), "medium", "good"))
wines$classification <- as.factor(wines$classification)
Once we have created that new variable, its time to start plotting our data. In this comparisson we are going to see the distribution of each feature in our dataset with every value colored by their quality or classification. Once we have found the variables that show a relationship between itself and quality we will develop it further so we can learn as much as we can from this dataset. For all the future plots I am using an special set of colors, given that ggplot colors of the values didnt help to recognise the possible relationships between the data. To fix that I am using the RcolorBrewer library which provides multiple sets of colors for our data. In this case I am using the Dark2 colorset.
p1 <- ggplot(aes(y=X, x = fixed.acidity, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
p2 <- ggplot(aes(y=X, x = fixed.acidity, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
grid.arrange(p1, p2, ncol = 1)
NA
As we can see the biggest part of the results are between five to eight of fixed.acidity approximately. There doesnt seem to be any changes in the behaviour of the data regarding the quality in this two plots.Unfortunately I dont seem to notice any direct relationship between the quality and the fixed acidity.
p1 <- ggplot(aes(y=X, x = volatile.acidity, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
p2 <- ggplot(aes(y=X, x = volatile.acidity, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
grid.arrange(p1, p2, ncol = 1)
NA
This plots doesnt show us any trend of the quality due to the volatile.acidity. However we could speculate that the data is somewhat ordered given some “layers” around the 1000 and 3000 values in X. That can be more easily spotted in the first plot due to the fact that there are less colors.
p1 <- ggplot(aes(y=X, x = citric.acid, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$citric.acid, 0.95)))
p2 <- ggplot(aes(y=X, x = citric.acid, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$citric.acid, 0.95)))
grid.arrange(p1, p2, ncol = 1)
NA
The previous plots doesnt show any particular relationship between the citric acid and the quality of the wine. There are multiple values of different qualities with the same values of citric acid. That doesnt mean there is no relationship between the citric acid and the quality, it means that there is no direct relationship between them. However there could be an indirect relationship.
p1 <- ggplot(aes(y=X, x = residual.sugar, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$residual.sugar, 0.95)))
p2 <- ggplot(aes(y=X, x = residual.sugar, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$residual.sugar, 0.95)))
grid.arrange(p1, p2, ncol = 1)
NA
This plots show a little more promise, given the fact that there seems to be a bigger concentration of high quality wine with a residual sugar near 0. In order to look into it I will take out the outliers so we can see the data more thoroughly. Now that the outliers have been removed we see that there is a pretty significant build up of good quality wines with a residual sugar from 2 to 5. In the next part this variable will be investigated further so we can prove its relationship with the quality variable.
p1 <- ggplot(aes(y=X, x = chlorides, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$chlorides, 0.95)))
p2 <- ggplot(aes(y=X, x = chlorides, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$chlorides, 0.95)))
grid.arrange(p1, p2, ncol = 1)
NA
Thanks to this plot we can see that there could be a relationship between the chlorides and the quality, given that there seems to be a concentration of good quality wines for chloride concentrations between 0.02 and 0.04. However thanks to some tweaking of the data and some visual comparassions of the two plots, each showing the different factors. Thanks to it, we realized that there is no correlation between the two variables.
p1 <- ggplot(aes(y=X, x = free.sulfur.dioxide, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$free.sulfur.dioxide, 0.95)))
p2 <- ggplot(aes(y=X, x = free.sulfur.dioxide, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$free.sulfur.dioxide, 0.95)))
grid.arrange(p1, p2, ncol = 1)
NA
As the previous plots this one doesnt show any special patterns indicating that there is a direct relationship between quality and the free sulfur dioxide quantities.
p1 <- ggplot(aes(y=X, x = total.sulfur.dioxide, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$total.sulfur.dioxide, 0.98)))
p2 <- ggplot(aes(y=X, x = total.sulfur.dioxide, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(0,quantile(wines$total.sulfur.dioxide, 0.98)))
grid.arrange(p1, p2, ncol = 1)
NA
These plots are looking for a relationship between the quality and the total sulfur dioxide. As we can see there doesnt seem to be any direct relationship between the variables
p1 <- ggplot(aes(y=X, x = density, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(quantile(wines$density, 0.00),quantile(wines$density, 0.95)))
p2 <- ggplot(aes(y=X, x = density, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(quantile(wines$density, 0.00),quantile(wines$density, 0.95)))
grid.arrange(p1, p2, ncol = 1)
NA
These plots are showing the relationship between the density of white wine and its quality. However it doesnt seem to be any correlation between the two of them. Thanks to some tweaks, was able to see how the low quality wines are spread across the different values, the same happens with the medium quality ones (5-6).
p1 <- ggplot(aes(y=X, x = pH, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(quantile(wines$pH, 0.00),quantile(wines$pH, 0.98)))
p2 <- ggplot(aes(y=X, x = pH, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
coord_cartesian(xlim = c(quantile(wines$pH, 0.00),quantile(wines$pH, 0.98)))
grid.arrange(p1, p2, ncol = 1)
NA
As we can see the different levels of pH doesnt seem to have any effect in the quality of the wine. It wasnt even necessary to separate the different qualities in order to see the lack of correlation. Due to the fact that the values are completely scattered above the plot.
p1 <- ggplot(aes(y=X, x = sulphates, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
p2 <- ggplot(aes(y=X, x = sulphates, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
grid.arrange(p1, p2, ncol = 1)
NA
The different values of sulphates across our dataset show us a non existing relationship between the sulphates and the quality of the wines tested. This only serves to reinforce the idea I stated at the start of the project, that there are no specific chemicals that make a wine good, but a combination of all.
p1 <- ggplot(aes(y=X, x = alcohol, color = classification), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
p2 <- ggplot(aes(y=X, x = alcohol, color = quality), data = wines) +
geom_point() +
scale_color_brewer(palette = "Dark2")
grid.arrange(p1, p2, ncol = 1)
NA
This last group of plots show a slight tendency of the high quality wine towards the highest alcohol values. As we can see the medium quality wine tend to have alcohol levels between 8.5 and 11. However to prove this correlation we are going to have to investigate this graph further.
Once I have studied all the possible direct relationships with quality, I am going to investigate these plots that seemed promissing. ### ### Alcohol
ggpairs(wines, mapping = aes(color=classification), columns = c("X", "fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide", "density", "pH", "sulphates", "alcohol"))
plot: [1,1] [=---------------------------------------------------------------------------------------] 1% est: 0s
plot: [1,2] [=---------------------------------------------------------------------------------------] 1% est: 7s
plot: [1,3] [==--------------------------------------------------------------------------------------] 2% est: 8s
plot: [1,4] [==--------------------------------------------------------------------------------------] 3% est: 8s
plot: [1,5] [===-------------------------------------------------------------------------------------] 3% est: 8s
plot: [1,6] [====------------------------------------------------------------------------------------] 4% est: 8s
plot: [1,7] [====------------------------------------------------------------------------------------] 5% est: 8s
plot: [1,8] [=====-----------------------------------------------------------------------------------] 6% est: 8s
plot: [1,9] [======----------------------------------------------------------------------------------] 6% est: 9s
plot: [1,10] [======----------------------------------------------------------------------------------] 7% est: 9s
plot: [1,11] [=======---------------------------------------------------------------------------------] 8% est: 9s
plot: [1,12] [=======---------------------------------------------------------------------------------] 8% est: 9s
plot: [2,1] [========--------------------------------------------------------------------------------] 9% est: 9s
plot: [2,2] [=========-------------------------------------------------------------------------------] 10% est: 9s
plot: [2,3] [=========-------------------------------------------------------------------------------] 10% est: 9s
plot: [2,4] [==========------------------------------------------------------------------------------] 11% est: 9s
plot: [2,5] [==========------------------------------------------------------------------------------] 12% est: 9s
plot: [2,6] [===========-----------------------------------------------------------------------------] 12% est: 9s
plot: [2,7] [============----------------------------------------------------------------------------] 13% est: 9s
plot: [2,8] [============----------------------------------------------------------------------------] 14% est: 8s
plot: [2,9] [=============---------------------------------------------------------------------------] 15% est: 8s
plot: [2,10] [=============---------------------------------------------------------------------------] 15% est: 8s
plot: [2,11] [==============--------------------------------------------------------------------------] 16% est: 8s
plot: [2,12] [===============-------------------------------------------------------------------------] 17% est: 8s
plot: [3,1] [===============-------------------------------------------------------------------------] 17% est: 8s
plot: [3,2] [================------------------------------------------------------------------------] 18% est: 8s
plot: [3,3] [================------------------------------------------------------------------------] 19% est: 8s
plot: [3,4] [=================-----------------------------------------------------------------------] 19% est: 8s
plot: [3,5] [==================----------------------------------------------------------------------] 20% est: 8s
plot: [3,6] [==================----------------------------------------------------------------------] 21% est: 8s
plot: [3,7] [===================---------------------------------------------------------------------] 22% est: 8s
plot: [3,8] [====================--------------------------------------------------------------------] 22% est: 8s
plot: [3,9] [====================--------------------------------------------------------------------] 23% est: 8s
plot: [3,10] [=====================-------------------------------------------------------------------] 24% est: 8s
plot: [3,11] [=====================-------------------------------------------------------------------] 24% est: 7s
plot: [3,12] [======================------------------------------------------------------------------] 25% est: 7s
plot: [4,1] [=======================-----------------------------------------------------------------] 26% est: 7s
plot: [4,2] [=======================-----------------------------------------------------------------] 26% est: 7s
plot: [4,3] [========================----------------------------------------------------------------] 27% est: 7s
plot: [4,4] [========================----------------------------------------------------------------] 28% est: 7s
plot: [4,5] [=========================---------------------------------------------------------------] 28% est: 7s
plot: [4,6] [==========================--------------------------------------------------------------] 29% est: 7s
plot: [4,7] [==========================--------------------------------------------------------------] 30% est: 7s
plot: [4,8] [===========================-------------------------------------------------------------] 31% est: 7s
plot: [4,9] [============================------------------------------------------------------------] 31% est: 7s
plot: [4,10] [============================------------------------------------------------------------] 32% est: 7s
plot: [4,11] [=============================-----------------------------------------------------------] 33% est: 7s
plot: [4,12] [=============================-----------------------------------------------------------] 33% est: 7s
plot: [5,1] [==============================----------------------------------------------------------] 34% est: 7s
plot: [5,2] [===============================---------------------------------------------------------] 35% est: 6s
plot: [5,3] [===============================---------------------------------------------------------] 35% est: 6s
plot: [5,4] [================================--------------------------------------------------------] 36% est: 6s
plot: [5,5] [================================--------------------------------------------------------] 37% est: 6s
plot: [5,6] [=================================-------------------------------------------------------] 38% est: 6s
plot: [5,7] [==================================------------------------------------------------------] 38% est: 6s
plot: [5,8] [==================================------------------------------------------------------] 39% est: 6s
plot: [5,9] [===================================-----------------------------------------------------] 40% est: 6s
plot: [5,10] [===================================-----------------------------------------------------] 40% est: 6s
plot: [5,11] [====================================----------------------------------------------------] 41% est: 6s
plot: [5,12] [=====================================---------------------------------------------------] 42% est: 6s
plot: [6,1] [=====================================---------------------------------------------------] 42% est: 6s
plot: [6,2] [======================================--------------------------------------------------] 43% est: 6s
plot: [6,3] [======================================--------------------------------------------------] 44% est: 6s
plot: [6,4] [=======================================-------------------------------------------------] 44% est: 6s
plot: [6,5] [========================================------------------------------------------------] 45% est: 6s
plot: [6,6] [========================================------------------------------------------------] 46% est: 5s
plot: [6,7] [=========================================-----------------------------------------------] 47% est: 5s
plot: [6,8] [==========================================----------------------------------------------] 47% est: 5s
plot: [6,9] [==========================================----------------------------------------------] 48% est: 5s
plot: [6,10] [===========================================---------------------------------------------] 49% est: 5s
plot: [6,11] [===========================================---------------------------------------------] 49% est: 5s
plot: [6,12] [============================================--------------------------------------------] 50% est: 5s
plot: [7,1] [=============================================-------------------------------------------] 51% est: 5s
plot: [7,2] [=============================================-------------------------------------------] 51% est: 5s
plot: [7,3] [==============================================------------------------------------------] 52% est: 5s
plot: [7,4] [==============================================------------------------------------------] 53% est: 5s
plot: [7,5] [===============================================-----------------------------------------] 53% est: 5s
plot: [7,6] [================================================----------------------------------------] 54% est: 5s
plot: [7,7] [================================================----------------------------------------] 55% est: 5s
plot: [7,8] [=================================================---------------------------------------] 56% est: 5s
plot: [7,9] [==================================================--------------------------------------] 56% est: 5s
plot: [7,10] [==================================================--------------------------------------] 57% est: 4s
plot: [7,11] [===================================================-------------------------------------] 58% est: 4s
plot: [7,12] [===================================================-------------------------------------] 58% est: 4s
plot: [8,1] [====================================================------------------------------------] 59% est: 4s
plot: [8,2] [=====================================================-----------------------------------] 60% est: 4s
plot: [8,3] [=====================================================-----------------------------------] 60% est: 4s
plot: [8,4] [======================================================----------------------------------] 61% est: 4s
plot: [8,5] [======================================================----------------------------------] 62% est: 4s
plot: [8,6] [=======================================================---------------------------------] 62% est: 4s
plot: [8,7] [========================================================--------------------------------] 63% est: 4s
plot: [8,8] [========================================================--------------------------------] 64% est: 4s
plot: [8,9] [=========================================================-------------------------------] 65% est: 4s
plot: [8,10] [=========================================================-------------------------------] 65% est: 4s
plot: [8,11] [==========================================================------------------------------] 66% est: 4s
plot: [8,12] [===========================================================-----------------------------] 67% est: 3s
plot: [9,1] [===========================================================-----------------------------] 67% est: 3s
plot: [9,2] [============================================================----------------------------] 68% est: 3s
plot: [9,3] [============================================================----------------------------] 69% est: 3s
plot: [9,4] [=============================================================---------------------------] 69% est: 3s
plot: [9,5] [==============================================================--------------------------] 70% est: 3s
plot: [9,6] [==============================================================--------------------------] 71% est: 3s
plot: [9,7] [===============================================================-------------------------] 72% est: 3s
plot: [9,8] [================================================================------------------------] 72% est: 3s
plot: [9,9] [================================================================------------------------] 73% est: 3s
plot: [9,10] [=================================================================-----------------------] 74% est: 3s
plot: [9,11] [=================================================================-----------------------] 74% est: 3s
plot: [9,12] [==================================================================----------------------] 75% est: 3s
plot: [10,1] [===================================================================---------------------] 76% est: 3s
plot: [10,2] [===================================================================---------------------] 76% est: 2s
plot: [10,3] [====================================================================--------------------] 77% est: 2s
plot: [10,4] [====================================================================--------------------] 78% est: 2s
plot: [10,5] [=====================================================================-------------------] 78% est: 2s
plot: [10,6] [======================================================================------------------] 79% est: 2s
plot: [10,7] [======================================================================------------------] 80% est: 2s
plot: [10,8] [=======================================================================-----------------] 81% est: 2s
plot: [10,9] [========================================================================----------------] 81% est: 2s
plot: [10,10] [========================================================================----------------] 82% est: 2s
plot: [10,11] [=========================================================================---------------] 83% est: 2s
plot: [10,12] [=========================================================================---------------] 83% est: 2s
plot: [11,1] [==========================================================================--------------] 84% est: 2s
plot: [11,2] [===========================================================================-------------] 85% est: 2s
plot: [11,3] [===========================================================================-------------] 85% est: 2s
plot: [11,4] [============================================================================------------] 86% est: 1s
plot: [11,5] [============================================================================------------] 87% est: 1s
plot: [11,6] [=============================================================================-----------] 88% est: 1s
plot: [11,7] [==============================================================================----------] 88% est: 1s
plot: [11,8] [==============================================================================----------] 89% est: 1s
plot: [11,9] [===============================================================================---------] 90% est: 1s
plot: [11,10] [===============================================================================---------] 90% est: 1s
plot: [11,11] [================================================================================--------] 91% est: 1s
plot: [11,12] [=================================================================================-------] 92% est: 1s
plot: [12,1] [=================================================================================-------] 92% est: 1s
plot: [12,2] [==================================================================================------] 93% est: 1s
plot: [12,3] [==================================================================================------] 94% est: 1s
plot: [12,4] [===================================================================================-----] 94% est: 1s
plot: [12,5] [====================================================================================----] 95% est: 1s
plot: [12,6] [====================================================================================----] 96% est: 0s
plot: [12,7] [=====================================================================================---] 97% est: 0s
plot: [12,8] [======================================================================================--] 97% est: 0s
plot: [12,9] [======================================================================================--] 98% est: 0s
plot: [12,10] [=======================================================================================-] 99% est: 0s
plot: [12,11] [=======================================================================================-] 99% est: 0s
plot: [12,12] [========================================================================================]100% est: 0s
ggplot(wines, aes(x = density, y = alcohol, color = classification)) +
geom_point()+
coord_cartesian(xlim = c(quantile(wines$density, 0.00),quantile(wines$density, 0.99)))